1 Exercise 1

1.1 Scrutinize the data to assess structure and quality. Are there any improbable or problematic entries? Provide a summary of checks performed and edit the data so entries are valid and meaningful where editing is reasonable to do.

The data set is quiet well structured and the quality of the data set is good. However, there are a few problematic entries and the following section will remove these entries and modify the data set such that the data can be used easily for further analysis.

  1. Though the data is meant to be collected from USA, there were a few entries where the company’s location was mentioned to be United Kingdom. Such entries were removed from the data set.
Job Title Location
Genomic Data Scientist Stevenage, United Kingdom
Scientist, Data, Methods and Analytics Immuno-inflammation and Specialty Medicines Stevenage, United Kingdom
Scientist in Data, Methods, & Analytics Brentford, United Kingdom
Lead Data Analyst Brentford, United Kingdom
  1. All the company names contained the ratings attached along with it eventhough it was provided in a separate column. Therefore, they were also removed.

  2. The Size of employees column was of type character, therefore, they were converted to factors and the levels were set accordingly.

  3. The Revenue column was all mentioned in USD, so the USD was removed from the columns and added to the column name.

  4. The salary estimate was very messy as it contained multiple factors/ranges and there were overlapping ranges too. The estimate contained different types such as Glassdoor estimate, employer estimate and per hour estimate. This has to be separated from the estimate value for easy data usage. The estimate ranges were reconstructed so that the number of different ranges are minimised.

  5. All the -1 values were converted to NAs

1.2 b. How many job listings provide salary (intervals) in a per hour basis?

There are 21 job listings that provide salary(intervals) on a per hour basis.

1.3 We want to investigate what the differences are between the job listings for those under different classification, i.e. business analytics, data analytics and data science. Compare across the classifications using appropriate graphics the:

1.3.1 salary intervals (study the minimum and maximum of the intervals)

Figure 1.1: Maximum and Minimum Salary comparison

Data Scientists have the highest Max salary limit and also the lowest Min Salary limit. This also shows how diverse the Data Scientist job classification can be.

1.3.2 location of the job (study by State)

Figure 1.2: Location of Business Analyst job by state

In USA, Business Analyst jobs are more popular in the state of Texas and California. The count seems to be significantly less in New York which is a very interesting observation.

Figure 1.3: Location of Data Analyst job by state

Compared to Business Analyst jobs, Data Analyst jobs are significantly lesser. Data Analyst Jobs are more popular in Texas, California and New York.

Figure 1.4: Location of Data Scientist job by state

The number of jobs for Data Scientists are comparatively higher when compared to Business and Data Analysts. This was also evident from the bar graph aove.

1.3.3 company size

Ratio of different company sizes for Business Analysts

Figure 1.5: Ratio of different company sizes for Business Analysts

Ratio of different company sizes for Data Analysts

Figure 1.6: Ratio of different company sizes for Data Analysts

Ratio of different company sizes for Data Scientists

Figure 1.7: Ratio of different company sizes for Data Scientists

The number of startups (having lesser employee count) are higher for Business Analyst field while comapred to the rest, while Data Scientists have more oppurtunities in larger companies.

1.3.4 Industry

Business Analyst in various Industries

Figure 1.8: Business Analyst in various Industries

Data Analyst in various Industries

Figure 1.9: Data Analyst in various Industries

Data Scientist in various Industries

Figure 1.10: Data Scientist in various Industries

Staff Outsourcing and IT services are the major industries where these 3 job classifications are predominant.

1.3.5 Sector

Data Scientist in various Sectors

Figure 1.11: Data Scientist in various Sectors

Data Analyst in various Sectors

Figure 1.12: Data Analyst in various Sectors

Business Analyst in various Sectors

Figure 1.13: Business Analyst in various Sectors

Information Technology and Business Services are the predominant sectors where wthese job classifications are required.

1.4 Your friend suspects that if an employer provides a salary range for the job, the salary is large and hence more attractive to potential candidates. Investigate this claim. Your investigation should be supported by graphics.

Maximum Salary vs Rating

Figure 1.14: Maximum Salary vs Rating

This claim seems to be true based on the above graph. As it can be seen, the job ratings get higher as the salary gets higher.

1.5 Is the location (via by State) associated with the salary and/or sector? Show graphics to best your conclusion.

Salary vs State

Figure 1.15: Salary vs State

The salary range in California, Texas and New York are comparitively higher when compared to the rest.

Figure 1.16: Sector vs State

The sector count is higher in Texas and California when compared to the rest. This mayb also be due to the number of listings that are more in number for these 2 states.

2 Exercise 2

2.1 Answer these questions from the data.

2.1.1 How many teams in the competition?

There are 14 teams in the competition.

2.1.2 How many players?

There are a total of 370 players.

2.1.3 How many rounds in the competition?

There are a total of 7 rounds in the competition.

2.2 The 2020 season was interrupted by COVID, so there was no winning team. Make an appropriate plot of the goals by team and suggest which team might have been likely to win if the season had played out.

As it can be observed, the highest goal scorers are team Kangaroos and team Fremantle. Therefore, they are more likely to win the 2020 season.

2.3 If you were to make a pairs plot of the numeric variables, how many plots would you need to make? (DON’T MAKE THE PLOT!!!)

The dataset contains 68 variables and out of which 34 are numeric variables. Since the pairs plot shows the distribution between single variables and between 2 variables, the total pair plots that can be made will be 34 * 34 = 1156. However, the variable jumper id has been duplicated thrice which makes it 31 * 31 = 961. Total would be 528 which comprises of the number of diagonals (433), upper and lower triangles.

2.4 Summarise the players, by computing the means for all of the statistics. On this data, one pair of variables variables has an L-shaped pattern. (See the slides from week 7 if you need a reminder what this shape is.) Use scagnostics to find the pair. Make the plot, report the scagnostic used. Write a sentence to explain the relationship between the two variables, in terms of players skills.

The Scagnostics striated and stringy were used to arrive at the L-shaped plots. Since striated checks the straightness of the points and stringy checks the dispersion. This yielded the variables hitputs and bounces.

2.5 Find a pair of variables that exhibit a barrier. Plot it and report the scagnostic used. Write sentence explaining the relationship.

The data seemed to have a barrier where in the value does not go beyond a certain x,y value.

2.6 Writing code similar to that in lecture 7B, make an interactive plotly parallel coordinate plot of the scagnostics. You can also refer to the plotly website to work out some of the difficult parts. There are two pieces that are really important to have:

2.6.1 scale on each axis needs to be 0-1, not individual variable range

2.6.2 the text outputted when traces are selected should include the pair of variables with that set of scagnostic values.

# Shiny

ui <- fluidPage(
  plotlyOutput("parcoords"),
  verbatimTextOutput("data"))


server <- function(input, output, session) { 
  
  aflw_num <- aflw_scags[,3:15]
  
output$parcoords <- renderPlotly({ 
  dims <- Map(function(x, y) {
      list(values = x,
           range = range(0,1), 
           label = y)
    
    }, aflw_num, 
    names(aflw_num), 
    USE.NAMES = FALSE)
  
    plot_ly(type = 'parcoords', 
            dimensions = dims, 
            source = "pcoords") %>% 
      layout(margin = list(r = 30)) %>%
      event_register("plotly_restyle")
})

ranges <- reactiveValues()
  observeEvent(event_data("plotly_restyle", 
                          source = "pcoords"),
  {
    d <- event_data("plotly_restyle", 
                    source = "pcoords")
    
    dimension <- as.numeric(stringr::str_extract(names(d[[1]]),"[0-9]+"))
    
    
    if (!length(dimension)) return()
    
    dimension_name <- names(aflw_numeric)[[dimension + 1]]
    
    info <- d[[1]][[1]]
    ranges[[dimension_name]] <- if (length(dim(info)) == 3) {
      lapply(seq_len(dim(info)[2]), function(i) info[,i,])
    } else {
      list(as.numeric(info))
    }
  })
  
  aflw_selected <- reactive({
    keep <- TRUE
    for (i in names(ranges)) {
      range_ <- ranges[[i]]
      keep_var <- FALSE
      for (j in seq_along(range_)) {
        rng <- range_[[j]]
        keep_var <- keep_var | dplyr::between(aflw_scags[[i]], 
                                              min(rng), max(rng))
      }
      keep <- keep & keep_var
    }
    aflw_scags[keep, ]
  })
  
  output$data <- renderPrint({
    tibble::as_tibble(aflw_selected())
  })
}


shinyApp(ui, server)

2.6.3 Summarise the relationships between the scagnostics, in terms of positive and negative association, outliers, clustering.

Clumpy and Covex have relatively lower values when compared to the rest. There seems to be outliers in convex, skinny and clumpy data. Sparse and Skewed show clumpiness while the others are more spreadout.

2.6.4 Pairs that have high values on convex (non-zero) tend to have what type of values on outlying, stringy, striated, skewed, skinny and splines?

Outlying: 0.0 - 0.2 Stringy: 0.6 Straited: 0.2 - 0.8 Skewed: 0.7 Skinny: 0.4 Splines: 0.5

2.6.5 Pairs of variables that have high values on skewed tend to have what type of values on outlying, stringy, striated, and splines?

Outlying: > 0.4 Stringy, Striated: > 0.8 Splines: 0

2.6.6 Identify one pair of variables that might be considered to have an unusual combination of scagnostic values, ie is an outlier in the scagnostics.

Clumpy and Convex

3 Exercise 3

3.1 Make a plot (or two) of the data that provides a suitable comparison between the pageviews of the two groups relative to time. Write a sentence comparing the two groups.

The pageviews of both the control and experimental data are similar in value.

3.2 Make an appropriate transformation of the data, and plot, to examine whether there is a difference in Clicks, summarising what you learn.

The number of clicks between both the data sets are also similar in value.

3.3 Repeat (b) to check if there is a difference between the groups in Enrollments, summarising what you learn.

Again, there is not much difference between the 2 data sets. However the number of enrollments in November is significantly lesser than October.

3.4 Repeat (b) to check if there is a difference between the groups in Payments, summarising what you learn.

The number of payments are also similar among the two datasets. Payments have been higher in October.

3.4.1 The variables can be considered to monitor the flow of visitor traffic to the site. Pageviews is the number of visitors to the site, and some of these will click on the page. From those that click on the site some will enrol, and some of those that enrol will continue to pay for the service. Make a suitable plot to examine the flow of traffic, so that you can compare the flow between the two groups.

The flow gets significantly reduced when moving from one part to another.

3.4.2 Check what you learn about the difference in flow of traffic between control and experiment using a lineup plot.

4 Exercise 4

4.1 Conduct a two-sample t-test and a Wilcoxon rank sum test to compare the mean cholesterol chol_red between the margarine brands after 4 weeks. What graphics best compares these measurements across the brands? What do you conclude from the results of the tests and your graphics?

## 
##  Two Sample t-test
## 
## data:  chol_red by Margarine
## t = -2.5186, df = 16, p-value = 0.0228
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.29671838 -0.02550384
## sample estimates:
## mean in group A mean in group B 
##       0.4855556       0.6466667
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  chol_red by Margarine
## W = 16, p-value = 0.03388
## alternative hypothesis: true location shift is not equal to 0
## [1] -0.1611111